Scenerio¶

Imagine there's a bank that wants to segment their customers into clusters based on repayment potential. Kind of like a credit score, but not manually calculated.

We are using data from the Lending Club, a US peer-to-peer lending company. Borrowers on the platform can borrow between $1000 to $40,000 in the form of unsecured personal loans, for a term of either three or five years.

Investors can browse the loan applications and choose to finance the loans based on the credit history of the borrower, the amount of the loan, the loan grade, and the purpose of the loan. Investors earn money through interest paid on the loans, and Lending Club makes money from loan origination fees and service charges.

The timeframe of the data is from 2019 and is publicly available on the Lending Club Website (must make an account). A data dictionary is also available.

Data¶


Data Import¶

In [2]:
files = glob.glob(os.path.join('data', '*.zip'))
data = pd.concat((pd.read_csv(file, compression='zip', header=1, low_memory=False) for file in files), ignore_index=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 518115 entries, 0 to 518114
Columns: 150 entries, id to settlement_term
dtypes: float64(112), object(38)
memory usage: 592.9+ MB
In [3]:
data.shape
Out[3]:
(518115, 150)

Column Filtering¶

In [4]:
data.columns
Out[4]:
Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'orig_projected_additional_accrued_interest',
       'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
       'debt_settlement_flag', 'debt_settlement_flag_date',
       'settlement_status', 'settlement_date', 'settlement_amount',
       'settlement_percentage', 'settlement_term'],
      dtype='object', length=150)

I don't really need all these columns, so we'll narrow them down.

In [5]:
print('old shape: ', data.shape)
# Select columns to keep
columnsToKeep = ['loan_amnt','funded_amnt','funded_amnt_inv','term', \
                 'int_rate','installment','grade','sub_grade', \
                 'emp_length','home_ownership','annual_inc', \
                 'verification_status','pymnt_plan','purpose', \
                 'addr_state','dti','delinq_2yrs','earliest_cr_line', \
                 'mths_since_last_delinq','mths_since_last_record', \
                 'open_acc','pub_rec','revol_bal','revol_util', \
                 'total_acc','initial_list_status','out_prncp', \
                 'out_prncp_inv','total_pymnt','total_pymnt_inv', \
                 'total_rec_prncp','total_rec_int','total_rec_late_fee', \
                 'recoveries','collection_recovery_fee','last_pymnt_d', \
                 'last_pymnt_amnt']

data = data.loc[:,columnsToKeep]
print('new shape: ', data.shape)
old shape:  (518115, 150)
new shape:  (518115, 37)
  • loan_amnt : The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
  • funded_amnt : The total amount committed to that loan at that point in time.
  • funded_amnt_inv : The total amount committed by investors for that loan at that point in time.
  • term : The number of payments on the loan. Values are in months and can be either 36 or 60.
  • int_rate : Interest Rate on the loan
  • installment : The monthly payment owed by the borrower if the loan originates.
  • grade : LC assigned loan grade
  • sub_grade : LC assigned loan subgrade
  • emp_length : Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
  • home_owndership : The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.
  • annual_inc : The self-reported annual income provided by the borrower during registration.
  • verification_status : The status of the loan during the listing period. Values: APPROVED, NOT_APPROVED.
  • pymnt_plan : Indicates if a payment plan has been put in place for the loan
  • purpose : A category provided by the borrower for the loan request.
  • addr_state : The state provided by the borrower in the loan application
  • dti : A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
  • delinq_2yrs : The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
  • earliest_cr_line : The date the borrower's earliest reported credit line was opened
  • mths_since_last_delinq : The number of months since the borrower's last delinquency.
  • mths_since_last_record : The number of months since the last public record.
  • open_acc : The number of open credit lines in the borrower's credit file.
  • pub_rec : Number of derogatory public records
  • revol_bal : Total credit revolving balance
  • revol_util : Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
  • total_acc : The total number of credit lines currently in the borrower's credit file
  • initial_list_status : The initial listing status of the loan. Possible values are – W, F
  • out_prncp : Remaining outstanding principal for total amount funded
  • out_prncp_inv : Remaining outstanding principal for portion of total amount funded by investors
  • total_pymnt : Payments received to date for total amount funded
  • total_pymnt_inv : Payments received to date for portion of total amount funded by investors
  • total_rec_prncp : Principal received to date
  • total_rec_int : Interest received to date
  • total_rec_late_fee : Late fees received to date
  • recoveries : post charge off gross recovery
  • collection_recovery_fee : post charge off collection fee
  • last_pymnt_d : Last month payment was received
  • last_pymnt_amnt : Last total payment amount received
In [6]:
data.head()
Out[6]:
loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_length home_ownership ... out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt
0 5000.0 5000.0 5000.0 36 months 13.90% 170.65 C C1 NaN OWN ... 3164.76 2552.03 2552.03 1835.24 716.79 0.0 0.0 0.0 Jun-2020 170.65
1 6500.0 6500.0 6475.0 36 months 8.81% 206.13 A A5 10+ years OWN ... 0.00 7015.28 6988.30 6500.00 515.28 0.0 0.0 0.0 Apr-2020 4754.21
2 24000.0 24000.0 24000.0 60 months 13.90% 557.20 C C1 10+ years MORTGAGE ... 20427.69 7206.53 7206.53 3572.31 3634.22 0.0 0.0 0.0 Jun-2020 557.20
3 18500.0 18500.0 18500.0 60 months 14.74% 437.60 C C2 NaN MORTGAGE ... 15121.23 6533.70 6533.70 3378.77 3154.93 0.0 0.0 0.0 Jul-2020 437.60
4 24000.0 24000.0 24000.0 60 months 20.00% 635.86 D D2 8 years MORTGAGE ... 20017.99 9484.57 9484.57 3982.01 5502.56 0.0 0.0 0.0 Jun-2020 635.86

5 rows × 37 columns

D-type and Nulls Cleaning¶

In [7]:
# Transform features from string to numeric
for i in ["term","int_rate","emp_length","revol_util"]:
    data.loc[:,i] = \
        data.loc[:,i].apply(lambda x: re.sub("[^0-9]", "", str(x)))
    data.loc[:,i] = pd.to_numeric(data.loc[:,i])

# Determine which features are numerical
numericalFeats = [x for x in data.columns if data[x].dtype != 'object']

nanCounter = np.isnan(data.loc[:,numericalFeats]).sum()
nanCounter
Out[7]:
loan_amnt                       8
funded_amnt                     8
funded_amnt_inv                 8
term                            8
int_rate                        8
installment                     8
emp_length                  46259
annual_inc                      8
dti                          1153
delinq_2yrs                     8
mths_since_last_delinq     292561
mths_since_last_record     462994
open_acc                        8
pub_rec                         8
revol_bal                       8
revol_util                    682
total_acc                       8
out_prncp                       8
out_prncp_inv                   8
total_pymnt                     8
total_pymnt_inv                 8
total_rec_prncp                 8
total_rec_int                   8
total_rec_late_fee              8
recoveries                      8
collection_recovery_fee         8
last_pymnt_amnt                 8
dtype: int64
In [8]:
# Impute NaNs with mean 
fillWithMean = ['loan_amnt','funded_amnt','funded_amnt_inv','term', \
                'int_rate','installment','emp_length','annual_inc',\
                'dti','open_acc','revol_bal','revol_util','total_acc',\
                'out_prncp','out_prncp_inv','total_pymnt', \
                'total_pymnt_inv','total_rec_prncp','total_rec_int', \
                'last_pymnt_amnt']

# Impute NaNs with zero
fillWithZero = ['delinq_2yrs','mths_since_last_delinq', \
                'mths_since_last_record','pub_rec','total_rec_late_fee', \
                'recoveries','collection_recovery_fee']

# Perform imputation
im = SimpleImputer(strategy='mean')   
data.loc[:,fillWithMean] = im.fit_transform(data[fillWithMean])

data.loc[:,fillWithZero] = data.loc[:,fillWithZero].fillna(value=0,axis=1)

# Check for NaNs one last time
nanCounter = np.isnan(data.loc[:,numericalFeats]).sum()
nanCounter
Out[8]:
loan_amnt                  0
funded_amnt                0
funded_amnt_inv            0
term                       0
int_rate                   0
installment                0
emp_length                 0
annual_inc                 0
dti                        0
delinq_2yrs                0
mths_since_last_delinq     0
mths_since_last_record     0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                 0
total_acc                  0
out_prncp                  0
out_prncp_inv              0
total_pymnt                0
total_pymnt_inv            0
total_rec_prncp            0
total_rec_int              0
total_rec_late_fee         0
recoveries                 0
collection_recovery_fee    0
last_pymnt_amnt            0
dtype: int64

Feature Engineering and More Cleaning¶

In [9]:
# Feature engineering
# Typical ratios between loan amount, revolving balance, payments, 
# and the borrower's annual income
data['installmentOverLoanAmnt'] = data.installment/data.loan_amnt
data['loanAmntOverIncome'] = data.loan_amnt/data.annual_inc
data['revol_balOverIncome'] = data.revol_bal/data.annual_inc
data['totalPymntOverIncome'] = data.total_pymnt/data.annual_inc
data['totalPymntInvOverIncome'] = data.total_pymnt_inv/data.annual_inc
data['totalRecPrncpOverIncome'] = data.total_rec_prncp/data.annual_inc
data['totalRecIncOverIncome'] = data.total_rec_int/data.annual_inc

newFeats = ['installmentOverLoanAmnt','loanAmntOverIncome', \
            'revol_balOverIncome','totalPymntOverIncome', \
           'totalPymntInvOverIncome','totalRecPrncpOverIncome', \
            'totalRecIncOverIncome']

For some reason, issues were happening later on. Found out some newly created columns were becoming 'null' aka positive infinity.

In [10]:
print('NaN in numericalFeats: ', np.any(np.isnan(data.loc[:,numericalFeats])))

print('NaN in newFeats: ', np.any(np.isnan(data.loc[:,newFeats])))

data.loc[:,newFeats].max()
NaN in numericalFeats:  False
NaN in newFeats:  True
Out[10]:
installmentOverLoanAmnt    0.04295
loanAmntOverIncome             inf
revol_balOverIncome            inf
totalPymntOverIncome           inf
totalPymntInvOverIncome        inf
totalRecPrncpOverIncome        inf
totalRecIncOverIncome          inf
dtype: float64
In [11]:
print('old data shape: ', data.shape)
with pd.option_context('mode.use_inf_as_na', True):
    # Dropping the rows with nan  
    # (or inf) values 
    data.dropna(inplace=True)
    
print('NaN in numericalFeats: ', np.any(np.isnan(data.loc[:,numericalFeats])))

print('NaN in newFeats: ', np.any(np.isnan(data.loc[:,newFeats])))

print('new data shape: ', data.shape)
old data shape:  (518115, 44)
NaN in numericalFeats:  False
NaN in newFeats:  False
new data shape:  (516108, 44)

Feature Selection and Scaling¶


Features¶

In [12]:
# Select features for training
numericalPlusNewFeats = numericalFeats+newFeats
X_train = data.loc[:,numericalPlusNewFeats]

# Scale data
sX = pp.StandardScaler()
X_train.loc[:,:] = sX.fit_transform(X_train)
In [13]:
X_train.columns
Out[13]:
Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'emp_length', 'annual_inc', 'dti', 'delinq_2yrs',
       'mths_since_last_delinq', 'mths_since_last_record', 'open_acc',
       'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
       'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'collection_recovery_fee', 'last_pymnt_amnt', 'installmentOverLoanAmnt',
       'loanAmntOverIncome', 'revol_balOverIncome', 'totalPymntOverIncome',
       'totalPymntInvOverIncome', 'totalRecPrncpOverIncome',
       'totalRecIncOverIncome'],
      dtype='object')

Labels¶

In [14]:
data.grade.value_counts(dropna=False)
Out[14]:
A    165175
B    150906
C    121873
D     74726
E      3375
F        36
G        17
Name: grade, dtype: int64
In [15]:
# Designate labels for evaluation
labels = data.grade
labels.unique()
Out[15]:
array(['C', 'A', 'D', 'B', 'E', 'G', 'F'], dtype=object)
In [16]:
# Fill missing labels if there are any
labels = labels.fillna(value="Z")

# Convert labels to numerical values
lbl = pp.LabelEncoder()
lbl.fit(list(labels.values))
labels = pd.Series(data=lbl.transform(labels.values), name='grade')

# Store as y_train
y_train = labels
In [17]:
# View the changes
labelsOriginalVSNew = pd.concat([labels, data.grade],axis=1)
labelsOriginalVSNew.head(5)
Out[17]:
grade grade
0 2.0 C
1 0.0 A
2 2.0 C
3 2.0 C
4 3.0 D
In [18]:
# Compare loan grades with interest rates
interestAndGrade = pd.DataFrame(data=[data.int_rate,labels])
interestAndGrade = interestAndGrade.T

interestAndGrade.groupby("grade").mean()
Out[18]:
int_rate
grade
0.0 1285.583177
1.0 1286.537482
2.0 1288.398079
3.0 1290.670952
4.0 1243.752083
5.0 1351.028571
6.0 1209.294118

Now that we have X_train with all 34 numberical features and y_train with numerical loan grades (used only to validate the results).

Feature Correlation¶

In [19]:
mask = np.triu(X_train[list(X_train.columns)].corr())
plt.figure(figsize=(25,16))
sns.heatmap(X_train[list(X_train.columns)].corr(), annot=True, fmt='.2g', vmin=-1, vmax=1, center=0, cmap='coolwarm', mask=mask)
# plt.ylim(18, 0)
plt.title('Features Correlation Matrix Heatmap', size=16)
Out[19]:
Text(0.5, 1.0, 'Features Correlation Matrix Heatmap')

Observation¶

We can see many features are correlated linearly, such as:

  • loan_amnt, funded_amnt, funded_amnt_inv, installment.
  • pub_rec, mths_since_last_record
  • out_prncp_inv, out_prncp
  • total_rec_prncp, total_pymnt, total_pymnt_inv
  • recoveries, collection_recovery-fee
  • loanAmntOverIncome, revol_balOverIncome, totalPymntOverIncome, totalPymntInvOverIncome, totalRecPrncpOverIncome, totalRecIncOverIncome
  • installmentOverLoanAmnt, term

I'll try removing these and see if we can get a streamlined set of features.

In [21]:
X_train = X_train.drop(['funded_amnt', 'funded_amnt_inv', 'installment'], axis=1) # loan_amnt
X_train = X_train.drop(['mths_since_last_record'], axis=1) # pub_rec
X_train = X_train.drop(['out_prncp_inv'], axis=1) # out_prncp
X_train = X_train.drop(['total_pymnt', 'total_pymnt_inv'], axis=1) # total_rec_prncp
X_train = X_train.drop(['collection_recovery_fee'], axis=1) # recoveries
X_train = X_train.drop(['revol_balOverIncome', 'totalPymntOverIncome', 'totalPymntInvOverIncome', 'totalRecPrncpOverIncome', 'totalRecIncOverIncome'], axis=1) # loanAmntOverIncome
X_train = X_train.drop(['installmentOverLoanAmnt'], axis=1) # term

New Feature Correlation¶

In [22]:
mask = np.triu(X_train[list(X_train.columns)].corr())
plt.figure(figsize=(25,16))
sns.heatmap(X_train[list(X_train.columns)].corr(), annot=True, fmt='.2g', vmin=-1, vmax=1, center=0, cmap='coolwarm', mask=mask)
# plt.ylim(18, 0)
plt.title('Features Correlation Matrix Heatmap', size=16)
Out[22]:
Text(0.5, 1.0, 'Features Correlation Matrix Heatmap')

PCA Features¶

In [24]:
pca_ = PCA().fit(X_train)
X_pca = PCA().fit_transform(X_train)
In [25]:
pca_evr = pca_.explained_variance_ratio_
cumsum_ = np.cumsum(pca_evr)

# Get dimensions where var >= 95%
dim_95 = np.argmax(cumsum_ >= 0.95) + 1
instances_, dims_ = X_train.shape

# check shape of X
if dims_ > instances_:
    print("WARNING: number of features greater than number of instances.")
    dimensions = list(range(1, instances_+1))
else:
    dimensions = list(range(1, dims_+1))
    
# Print report
print(f"You can reduce from {dims_} to {dim_95} dimensions while retaining 95% of variance.")
You can reduce from 20 to 16 dimensions while retaining 95% of variance.
In [26]:
pca_16 = PCA(n_components=16).fit(X_train)
X_pca_16 = PCA(n_components=16).fit_transform(X_train)

PCAdf = pd.DataFrame(pca_16.components_, columns=list(X_train.columns))

# Load the example flights dataset and convert to long-form
feat_cor = PCAdf.T

# Draw a heatmap with the numeric values in each cell
f, ax = plt.subplots(figsize=(15, 15))
sns.heatmap(feat_cor, annot=True, linewidths=.5, ax=ax, cmap='coolwarm')
plt.title('PCA Dimension Components Matrix Heatmap', size=16)
Out[26]:
Text(0.5, 1.0, 'PCA Dimension Components Matrix Heatmap')
In [28]:
feat_cor.apply(lambda x: print(x.sort_values(ascending=False).head(5)), axis=0)
loan_amnt        0.491383
total_rec_int    0.447772
out_prncp        0.412221
term             0.296014
revol_bal        0.284829
Name: 0, dtype: float64
last_pymnt_amnt    0.560076
total_rec_prncp    0.542091
total_acc          0.246680
open_acc           0.213005
annual_inc         0.114309
Name: 1, dtype: float64
open_acc                  0.551213
total_acc                 0.538018
dti                       0.113655
revol_bal                 0.108285
mths_since_last_delinq    0.103607
Name: 2, dtype: float64
annual_inc            0.276965
out_prncp             0.267413
loan_amnt             0.194992
revol_bal             0.140314
loanAmntOverIncome   -0.002490
Name: 3, dtype: float64
revol_util     0.624533
revol_bal      0.477545
annual_inc     0.268811
delinq_2yrs    0.141519
dti            0.109721
Name: 4, dtype: float64
delinq_2yrs               0.488972
annual_inc                0.374840
total_rec_late_fee        0.334446
recoveries                0.219888
mths_since_last_delinq    0.197932
Name: 5, dtype: float64
delinq_2yrs           0.369213
pub_rec               0.244885
recoveries            0.203391
total_rec_late_fee    0.178701
dti                   0.120163
Name: 6, dtype: float64
recoveries                0.667965
total_rec_late_fee        0.368561
pub_rec                   0.259649
mths_since_last_delinq    0.107204
revol_bal                 0.105065
Name: 7, dtype: float64
pub_rec               0.580689
emp_length            0.574274
annual_inc            0.249403
revol_util            0.212957
loanAmntOverIncome    0.157230
Name: 8, dtype: float64
loanAmntOverIncome        0.983359
mths_since_last_delinq    0.051820
total_rec_late_fee        0.051457
delinq_2yrs               0.040067
dti                       0.023447
Name: 9, dtype: float64
recoveries            0.582447
term                  0.142224
delinq_2yrs           0.085341
loanAmntOverIncome    0.047750
last_pymnt_amnt       0.032539
Name: 10, dtype: float64
emp_length            0.762597
total_rec_late_fee    0.222641
dti                   0.188678
recoveries            0.181077
loan_amnt             0.058533
Name: 11, dtype: float64
delinq_2yrs               0.447137
mths_since_last_delinq    0.414639
pub_rec                   0.403159
dti                       0.304807
loan_amnt                 0.208964
Name: 12, dtype: float64
annual_inc    0.688497
dti           0.568426
term          0.097892
int_rate      0.083257
emp_length    0.036538
Name: 13, dtype: float64
term                      0.732542
revol_bal                 0.304663
delinq_2yrs               0.196178
mths_since_last_delinq    0.135775
last_pymnt_amnt           0.122975
Name: 14, dtype: float64
revol_bal                 0.692853
int_rate                  0.254645
mths_since_last_delinq    0.129626
total_rec_int             0.088147
delinq_2yrs               0.061726
Name: 15, dtype: float64
Out[28]:
0     None
1     None
2     None
3     None
4     None
5     None
6     None
7     None
8     None
9     None
10    None
11    None
12    None
13    None
14    None
15    None
dtype: object

By running PCA, the cumsum shows that we can retain 95% of the variance with only 16 dimensions rather than 21. So by running PCA again with only 16 dimensions and checking on the components of each dimension, we can see what features are connected. Here's what we can see:

Dim0¶

  • loan_amnt 0.49
  • total_rec_int 0.44
  • out_prncp 0.41

Dim1¶

  • last_pymnt_amnt 0.56
  • total_rec_prncp 0.54

Dim2¶

  • open_acc 0.55
  • total_acc 0.53

Dim3¶

  • annual_inc 0.27
  • out_prncp 0.27

Dim4¶

  • revol_util 0.62
  • revol_bal 0.47

Dim5¶

  • delinq_2yrs 0.49
  • annual_inc 0.37
  • total_rec_late_fee 0.33

Dim6¶

  • delinq_2yrs 0.37
  • pub_rec 0.24
  • recoveries 0.20

Dim7¶

  • recoveries 0.67
  • total_rec_late_fee 0.37
  • pub_rec 0.26

Dim8¶

  • pub_rec 0.58
  • emp_length 0.57

Dim9¶

  • loanAmntOverIncome 0.98

Dim10¶

  • recoveries 0.58

Dim11¶

  • emp_length 0.76

Dim12¶

  • delinq_2yrs
  • mths_since_last_delinq
  • pub_rec

Dim13¶

  • annual_inc 0.69
  • dti 0.57

Dim14¶

  • term 0.73
  • revol_bal 0.30

Dim15¶

  • revol_bal 0.69
  • int_rate 0.25

When we run models, these will be the connections I will check out.

Sampling 10,000¶

In [29]:
# Random sample of 10,000 out of ~510,000
np.random.seed(123)

indices = np.random.choice(data.shape[0],10000)
X = X_train.iloc[indices] / 255.0
y = y_train.iloc[indices]

X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

print(X.shape, y.shape)
(10000, 20) (10000,)

Analyzing Clusters¶


The clusters we want should be filled with borrowers of similar numerical loan grade, which we will validate using the numerical loan grades we set aside in y. The higher the percentage of borrowers that have the most frequently occuring numerical loan grade in each and every cluster, the better the clustering application.

For example, consider a cluster with 100 borrowers. If 30 borrowers have the numerical grade of 0, 25 borrowers have a loan grade of 1, 20 borrowers have a loan grade of 2, and the remaining borrowers have loan grades ranging from 3 to 7, we would say that the cluster has a 30% accuracy, given that the most frequently occuring lonan grade for that cluster applies to just 30% of the borrowers in that cluster.

To analyze future clusters:

In [30]:
def analyzeCluster(clusterDF, labelsDF):
    countByCluster = pd.DataFrame(data=clusterDF['cluster'].value_counts())
    countByCluster.reset_index(inplace=True, drop=False)
    countByCluster.columns = ['cluster', 'clusterCount']
    
    preds = pd.concat([labelsDF, clusterDF], axis=1)
    preds.columns = ['trueLabel', 'cluster']
    
    countByLabel = pd.DataFrame(data=preds.groupby('trueLabel').count())
    
    countMostFreq = pd.DataFrame(data=preds.groupby('cluster').agg(lambda x : x.value_counts().iloc[0]))
    countMostFreq.reset_index(inplace=True, drop=False)
    countMostFreq.columns = ['cluster', 'countMostFrequent']
    
    accuracyDF = countMostFreq.merge(countByCluster, left_on='cluster', right_on='cluster')
    
    overallAccuracy = accuracyDF.countMostFrequent.sum()/accuracyDF.clusterCount.sum()
    
    accuracyByLabel = accuracyDF.countMostFrequent/accuracyDF.clusterCount
    
    return countByCluster, countByLabel, countMostFreq, accuracyDF, overallAccuracy, accuracyByLabel

Models¶

K-Means¶

In [31]:
n_clusters = 10
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 123
cluster_range = range(2,36)

kmeans = KMeans(n_clusters=n_clusters,
                n_init=n_init,
                max_iter=max_iter,
                tol=tol,
                random_state=random_state)
# Inertia
kMeans_inertia = pd.DataFrame(data=[],
                              index=cluster_range,
                              columns=['inertia'])
# Overall Accuracy
overallAccuracy_kMeansDF = pd.DataFrame(data=[],
                                        index=cluster_range,
                                        columns=['overallAccuracy'])
# ARI
ari_KMeansDF = pd.DataFrame(data=[],
                            index=cluster_range,
                            columns=['ari'])
# Silhouette Score
ss_KMeansDF = pd.DataFrame(data=[],
                            index=cluster_range,
                            columns=['ss'])
In [32]:
for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters,
                    n_init=n_init,
                    max_iter=max_iter,
                    tol=tol,
                    random_state=random_state)
    kmeans.fit(X)
    kMeans_inertia.loc[n_clusters] = kmeans.inertia_
    cluster = kmeans.predict(X)
    cluster_df = pd.DataFrame(data=cluster,
                         index=X.index,
                         columns=['cluster'])
    
    countByCluster_kMeans, countByLabel_kMeans, \
    countMostFreq_kMeans, accuracyDF_kMeans, \
    overallAccuracy_kMeans, accuracyByLabel_kMeans = \
    analyzeCluster(cluster_df, y)
    
    overallAccuracy_kMeansDF.loc[n_clusters] = overallAccuracy_kMeans
    
    ari_KMeansDF.loc[n_clusters] = adjusted_rand_score(y, cluster)
    ss_KMeansDF.loc[n_clusters] = silhouette_score(X, cluster, metric='euclidean')
In [33]:
overallAccuracy_kMeansDF.plot()
Out[33]:
<AxesSubplot:>
In [34]:
kMeans_inertia.plot()
Out[34]:
<AxesSubplot:>
In [35]:
ari_KMeansDF.plot()
Out[35]:
<AxesSubplot:>
In [36]:
ss_KMeansDF.plot()
Out[36]:
<AxesSubplot:>
In [37]:
accuracyByLabel_kMeans
Out[37]:
0     0.417886
1     0.356923
2     0.360938
3     0.750000
4     0.503448
5     0.404110
6     0.624506
7     0.718053
8     0.307692
9     0.333333
10    0.529412
11    0.401786
12    0.410072
13    1.000000
14    0.387097
15    0.527421
16    0.480000
17    0.500000
18    0.550000
19    0.342193
20    0.588477
21    0.478873
22    0.617357
23    0.505882
24    0.294821
25    0.273684
26    0.344262
27    0.664286
28    0.394737
29    0.368280
30    0.342342
31    0.380952
32    0.400000
33    0.399390
34    0.457082
dtype: float64

As we can see from overallAccuracy, the accuracy is best around 30 clusters and levels out there at aproximately 49% accuracy.

Inertia drops very gradually and consistently so there's never really an elbow to determine a clear number of clusters.

ARI shows a huge spike around 13-14 clusters. Silhoette Score shows a drop around 7 then flattens out, although the difference is not as dramatic.

By focusing on ARI and acknowledging the other evaluations, let us try 13 clusters.

KMeans n_components=13¶

In [38]:
n_clusters = 13
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 123

kmeans = KMeans(n_clusters=n_clusters,
                n_init=n_init,
                max_iter=max_iter,
                tol=tol,
                random_state=random_state)

kmeans.fit(X)
kMeans_inertia.loc[n_clusters] = kmeans.inertia_
cluster = kmeans.predict(X)
cluster_df = pd.DataFrame(data=cluster,
                     index=X.index,
                     columns=['cluster'])

countByCluster_kMeans, countByLabel_kMeans, \
countMostFreq_kMeans, accuracyDF_kMeans, \
overallAccuracy_kMeans, accuracyByLabel_kMeans = \
analyzeCluster(cluster_df, y)

print('overallAccuracy_kMeans: ', overallAccuracy_kMeans)
print('ARI: ', adjusted_rand_score(y, cluster))
print('Silhouette Score: ', silhouette_score(X, cluster, metric='euclidean'))
overallAccuracy_kMeans:  0.4524
ARI:  0.09279624324041397
Silhouette Score:  0.11465279039276895
In [39]:
accuracyByLabel_kMeans
Out[39]:
0     0.314159
1     0.400445
2     0.382230
3     0.386991
4     0.750000
5     0.526316
6     0.447368
7     0.617886
8     1.000000
9     0.371113
10    0.610671
11    0.400000
12    0.291498
dtype: float64
In [40]:
accuracyDF_kMeans
Out[40]:
cluster countMostFrequent clusterCount
0 0 284 904
1 1 360 899
2 2 641 1677
3 3 589 1522
4 4 3 4
5 5 40 76
6 6 17 38
7 7 608 984
8 8 1 1
9 9 370 997
10 10 1465 2399
11 11 2 5
12 12 144 494

We see why certain clusters were 100% accurate: they're cluster-size of 1.

Starting point to build a clustering application to automatically assign new borrowers that apply for a Lending Club loan into a pre-existing group based on how similar they are to the other borrowers.

In [41]:
cluster = kmeans.predict(X)
cluster_df = pd.DataFrame(data=cluster,
                     index=X.index,
                     columns=['cluster'])
In [42]:
kmeans_df = X
kmeans_df['cluster'] = cluster
In [43]:
len(kmeans_df.columns)
Out[43]:
21
Plotting every feature of every cluster¶
In [44]:
def plot_cluster(y, x='cluster', data=kmeans_df, ax=ax):
    sns.stripplot(data=data, x=x, y=y, ax=ax)
    sns.boxplot(data=data, x=x, y=y, ax=ax)
    ax.set_xlabel(x)
    ax.set_ylabel(y);
    ax.set_title(y, size=16)
    
fig, axes = plt.subplots(20, 1, figsize=(10,105))
for i in range(len(axes)):
    plot_cluster(y=kmeans_df.columns[i], ax=axes[i])

Grouping¶

Cluster 0¶

  • Cluster Size: 904
  • HIGH pub_rec (Number of derogatory public records, most likely bankrupcy filing)
  • low loan_amnt (Listed amount of the loan applied for by the borrower)
  • moderately low int_rate (Interest Rate on the loan)

The cluster is for people who filed for bankrupcy multiple times, thus generally low loan amounts and low interest rates.

In [45]:
dim_cols = ['pub_rec', 'loan_amnt', 'int_rate']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 1¶

  • Cluster Size 899
  • HIGH total_rec_int (Interest received to date)
  • high loan_amnt (listed amount of the loan applied for by the borrower)
  • high total_rec_late_fee (Late fees received to date)

The cluster is for people who pays a lot of interest, essentially due to high loans amount and late payments.

In [46]:
dim_cols = ['total_rec_int', 'loan_amnt', 'total_rec_late_fee']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 2¶

  • Cluster Size: 1677
  • Higher than Average revol_util (Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.)
  • moderately high delinq_2_yrs (The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years)
  • low loan_amnt (listed amount of the loan applied for by the borrower)
  • low total_rec_int (Late fees received to date)

This cluster consists of users who use up a higher percentage of their credit limit (revolving line utilization rate), increasing their chances of delinquency (non-payment count), thus only being accepted for lower loan amounts, amounting to less interest accumulation, despite their increased chances of missing payments.

In [47]:
dim_cols = ['revol_util', 'delinq_2yrs', 'loan_amnt', 'total_rec_int']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 3¶

  • Cluster Size: 1522
  • Average loan_amnt (listed amount of the loan applied for by the borrower)
  • Presense of delinq_2yrs (The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years)
  • Presense of total_rec_late_fee (Late fees received to date)

This cluster, normal borrowers who occasionally miss payments. They are overall average with every category, gets average loan amount, but occasionally misses payments thus has slightly high fees.

In [48]:
dim_cols = ['loan_amnt', 'delinq_2yrs', 'total_rec_late_fee']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 4:¶

  • Cluster Size: 4
  • HIGH annual_inc (The self-reported annual income provided by the borrower during registration.)
  • low open_acc (The number of open credit lines in the borrower's credit file.)
  • low total_acc (The total number of credit lines currently in the borrower's credit file)
  • low delinq_2yrs (The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years)
  • low mths_since_last_delinq (The number of months since the borrower's last delinquency.)

This cluster of 4 has insane amount of income. Surprisingly, they are also very good with their money as they open the least number of accounts and never miss payments.

In [49]:
dim_cols = ['annual_inc', 'open_acc', 'total_acc', 'delinq_2yrs', 'mths_since_last_delinq']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 5¶

  • Cluster Size: 76
  • High revol_bal (Total credit revolving balance)
  • low int_rate (Interest Rate on the loan)

This cluster has a lot of total credit revolvinng balance and on average, a low interest rate.

In [50]:
dim_cols = ['revol_bal', 'int_rate']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 6¶

  • Cluster Size: 38
  • High recoveries (Post charge off gross recovery)
  • Highest int_rate (Interest Rate on the loan)

This cluster has a lot of post charge-off recoveries and abnormally high interest rates.

In [51]:
dim_cols = ['recoveries', 'int_rate']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 7¶

  • Cluster Size: 984
  • high total_rec_late_fee (Late fees received to date)
  • slightly-higher out_prncp (Remaining outstanding principal for total amount funded)
  • slightly higher total_rec_int (Interest received to date)

This cluster is similar to Cluster 1, but without the high-interest payment, meaning they keep their insterest minimal, but have a lot of loan left to pay off.

In [52]:
dim_cols = ['total_rec_late_fee', 'out_prncp', 'total_rec_int']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 8¶

  • Cluster Size: 1
  • REALLY HIGH recoveries (post charge off gross recovery)
  • Lowest emp_length (Employment length in years. Possible values are between 0 and 10)

The one customer who got a massive previously charged-off loan recovered from someone who is not employed.

In [53]:
dim_cols = ['recoveries', 'emp_length']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 9¶

  • Cluster Size: 997
  • High open_acc (The number of open credit lines in the borrower's credit file.)
  • High total_acc (The total number of credit lines currently in the borrower's credit file)

This cluster has abnormal amount of open accounts.

In [54]:
dim_cols = ['open_acc', 'total_acc']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 10¶

  • Cluster Size: 2399
  • low loan_amnt (The listed amount of the loan applied for by the borrower.)
  • low int_rate (Interest Rate on the loan)
  • tends to be 3-year term (The number of payments on the loan. Values are in months and can be either 36 or 60.)
  • slightly more delinq_2yrs (The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years)

This cluster takes really low loan amounts with low interest rates with low term-rate.

In [55]:
dim_cols = ['loan_amnt', 'int_rate', 'term', 'delinq_2yrs']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 11¶

  • Cluster Size: 5
  • HIGH dti ("Debt-to-Income": A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.)
  • low annual_inc (The self-reported annual income provided by the borrower during registration.)

This cluster of 5 has the highest DTI (Debt-to-Income) ratio, unsurprisingly with abysmal annual income.

In [56]:
dim_cols = ['dti', 'annual_inc']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Cluster 12¶

  • Cluster Size: 494
  • HIGH total_rec_prncp (Principal received to date)
  • HIGH last_pymnt_amnt (Last total payment amount received)

This cluster gets higher-than average loans and pays off the principle quickly, with lump sum on final payment.

In [57]:
dim_cols = ['loan_amnt', 'total_rec_prncp', 'last_pymnt_amnt']
    
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
    plot_cluster(y=dim_cols[i], ax=axes[i])

Hierarchical Clustering¶

In [59]:
Z = fastcluster.linkage_vector(X, method='ward', metric='euclidean')
Z_df = pd.DataFrame(data=Z, columns=['clusterOne', 
                                    'clusterTwo', 
                                    'distance', 
                                    'newClusterSize'])
In [60]:
distance_threshold = 0.175
clusters = fcluster(Z, distance_threshold, criterion = 'distance')
X_hierClustered = pd.DataFrame(data=clusters,
                               index=X_PCA_df.index,
                               columns=['cluster'])
print("Number of distinct clusters: ", len(X_hierClustered['cluster'].unique()))
Number of distinct clusters:  22
In [61]:
countByCluster_hierClust, countByLabel_hierClust, countMostFreq_hierClust, \
accuracyDF_hierClust, overallAccuracy_hierClust, accuracyByLabel_hierClust = \
analyzeCluster(X_hierClustered, y)
print('Overall accuracy from hierarchical clustering: ', overallAccuracy_hierClust)
print('ARI: ', adjusted_rand_score(y, clusters))
print('Silhouette Score: ', silhouette_score(X, clusters, metric='euclidean'))
Overall accuracy from hierarchical clustering:  0.4608
ARI:  0.07169597463439129
Silhouette Score:  0.3114058391693978
In [62]:
accuracyByLabel_hierClust
Out[62]:
0     0.314159
1     0.400455
2     0.400000
3     0.382586
4     0.388451
5     0.431373
6     0.403800
7     0.370017
8     0.500000
9     0.619342
10    1.000000
11    0.447368
12    0.750000
13    0.526316
14    0.400000
15    0.373418
16    0.273810
17    0.387097
18    0.468563
19    0.409396
20    0.529900
21    0.637730
dtype: float64

Again, we can see percentage ranges from 27% to 100%, with the 100% because it's a cluster-size of 1.

In [63]:
accuracyDF_hierClust
Out[63]:
cluster countMostFrequent clusterCount
0 1 284 904
1 2 352 879
2 3 8 20
3 4 145 379
4 5 444 1143
5 6 44 102
6 7 170 421
7 8 427 1154
8 9 6 12
9 10 602 972
10 11 1 1
11 12 17 38
12 13 3 4
13 14 40 76
14 15 2 5
15 16 59 158
16 17 92 336
17 18 12 31
18 19 313 668
19 20 122 298
20 21 319 602
21 22 1146 1797
In [64]:
hier_df = X
hier_df['cluster'] = clusters

fig, axes = plt.subplots(20, 1, figsize=(10,105))
for i in range(len(axes)):
    plot_cluster(y=hier_df.columns[i], ax=axes[i], data=hier_df)

HDBSCAN Application¶

In [65]:
min_cluster_size = 20
min_samples = 20
alpha = 1.0
cluster_selection_method = 'leaf'

hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
                      min_samples=min_samples,
                      alpha=alpha,
                      cluster_selection_method=cluster_selection_method)

X_hdbscanClustered = hdb.fit_predict(X)
X_hdbscanClustered_df = pd.DataFrame(data=X_hdbscanClustered,
                                     index=X.index,
                                     columns=['cluster'])
countByCluster_hdbscan, countByLabel_hdbscan, \
countMostFreq_hdbscan, accuracyDF_hdbscan, \
overallAccuracy_hdbscan, accuracyByLabel_hdbscan = \
analyzeCluster(X_hdbscanClustered_df, y)

print('Overall accuracy from hierarchical clustering: ', overallAccuracy_hierClust)
print('ARI: ', adjusted_rand_score(y, X_hdbscanClustered))
print('Silhouette Score: ', silhouette_score(X, X_hdbscanClustered, metric='euclidean'))
Overall accuracy from hierarchical clustering:  0.4608
ARI:  0.061268351488627436
Silhouette Score:  0.6342216209546551
In [66]:
accuracyByLabel_hdbscan
Out[66]:
0     0.336927
1     0.447368
2     0.519481
3     0.373418
4     0.273810
5     0.529900
6     0.637730
7     0.409396
8     0.387097
9     0.468563
10    0.431373
11    0.388451
12    0.382586
13    0.403800
14    0.370017
15    0.500000
16    0.382311
17    0.469880
18    0.463415
19    0.742857
20    0.711864
dtype: float64
In [67]:
accuracyDF_hdbscan
Out[67]:
cluster countMostFrequent clusterCount
0 -1 375 1113
1 0 17 38
2 1 40 77
3 2 59 158
4 3 92 336
5 4 319 602
6 5 1146 1797
7 6 122 298
8 7 12 31
9 8 313 668
10 9 44 102
11 10 444 1143
12 11 145 379
13 12 170 421
14 13 427 1154
15 14 15 30
16 15 268 701
17 16 39 83
18 17 190 410
19 18 78 105
20 19 252 354

From the cluster dataframe, we can see almost a quarter of the data was not grouped at all. ARI and Silhouette Scores are a smidgen higher compared to previous methods. Pretty sure results can be improved.

In [68]:
hdb_df = X
hdb_df['cluster'] = X_hdbscanClustered

fig, axes = plt.subplots(20, 1, figsize=(10,105))
for i in range(len(axes)):
    plot_cluster(y=hier_df.columns[i], ax=axes[i], data=hdb_df)

Visualizations¶

3D¶

I chose 3D as 2D plots felt very restrictive.

In [69]:
# Color by Cluster
def dim_reduction_df(proj_3d, cluster_array):
    df = pd.DataFrame(data=proj_3d, columns=['x', 'y', 'z'])
    df['cluster'] = cluster_array
    return df

def color_by_cluster(cluster_df, proj_3d, cluster_name, dim_red_name):
    PLOT = go.Figure()
    df = dim_reduction_df(proj_3d, cluster_array=cluster_df['cluster'])
    for C in list(cluster_df.cluster.unique()):
        PLOT.add_trace(go.Scatter3d(x = df[df.cluster == C]['x'],
                                    y = df[df.cluster == C]['y'],
                                    z = df[df.cluster == C]['z'],
                                    mode = 'markers',
                                    marker_size = 3,
                                    marker_line_width = 1,
                                    name = 'Cluster ' + str(C)))

    PLOT.update_layout(autosize = True,
                      showlegend = True,
                       title = '{} with {}'.format(cluster_name, dim_red_name),
                      scene = dict(xaxis=dict(title = 'x'),
                                  yaxis=dict(title = 'y'),
                                  zaxis=dict(title = 'z')),
                      font = dict(family = 'Gilroy',
                                 color = 'black',
                                 size = 12))
    return PLOT

PCA¶

In [70]:
Xpca_3 = PCA(n_components=3, random_state=123).fit_transform(X)

#3D Plot of PCA
chart3d = go.Scatter3d(
    x = Xpca_3[:,0],# First Column
    y = Xpca_3[:,1],
    z = Xpca_3[:,2],
    mode='markers',
    marker=dict(
        color = y,
        size = 5,
        line=dict(color = y),
        opacity = 0.7
        )
    )
layout = go.Layout(
    title = 'Dimensionality Reduction with PCA',
    scene = dict(
        xaxis = dict(title = 'Dim 1'),
        yaxis = dict(title = 'Dim 2'),
        zaxis = dict(title = 'Dim 3')
        )
    )
In [71]:
# Show the chart
fig = go.Figure(data=chart3d, layout=layout)
py.offline.iplot(fig)

PCA colored with KMeans clusters¶

In [72]:
# PCA colored with Kmeans
color_by_cluster(kmeans_df, Xpca_3, 'Kmeans', 'PCA')

PCA colored with Hierarchical clusters¶

In [73]:
color_by_cluster(hier_df, Xpca_3, 'Heirarchical', 'PCA')

PCA colored with HDBScan clustering¶

In [74]:
color_by_cluster(hdb_df, Xpca_3, 'HDBScan', 'PCA')

TSNE¶

In [75]:
time_start = time.time()
Xtsne_3 = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300, random_state=123).fit_transform(X)
print('t-SNE done!  Time elapsed: {} seconds'.format(time.time()-time_start))
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 10000 samples in 0.062s...
[t-SNE] Computed neighbors for 10000 samples in 0.823s...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.003668
[t-SNE] KL divergence after 250 iterations with early exaggeration: 66.589195
[t-SNE] KL divergence after 300 iterations: 2.185263
t-SNE done!  Time elapsed: 34.56090807914734 seconds
In [76]:
chart3d_tsne = go.Scatter3d(
    x = Xtsne_3[:,0],# First Column
    y = Xtsne_3[:,1],
    z = Xtsne_3[:,2],
    mode='markers',
    marker=dict(
        color = y,
        size = 5,
        line=dict(color = y),
        opacity = 0.7
        )
    )
layout = go.Layout(
    title = 'Dimensionality Reduction with TSNE',
    scene = dict(
        xaxis = dict(title = 'Dim 1'),
        yaxis = dict(title = 'Dim 2'),
        zaxis = dict(title = 'Dim 3')
        )
    )
In [77]:
# Show the chart
fig = go.Figure(data=chart3d_tsne, layout=layout)
py.offline.iplot(fig)

TSNE colored with KMeans clusters¶

In [78]:
color_by_cluster(kmeans_df, Xtsne_3, 'KMeans', 'TSNE')

TSNE colored with Hierarchical Clusters¶

In [79]:
color_by_cluster(hier_df, Xtsne_3, 'Hierarchical', 'TSNE')

TSNE Colored with HDB Clusters¶

In [80]:
color_by_cluster(hdb_df, Xtsne_3, 'HDBScan', 'TSNE')

UMAP¶

In [81]:
umap_3d = UMAP(n_components=3, init='random', random_state=123)

proj_3d = umap_3d.fit_transform(X)
In [82]:
fig_3d = px.scatter_3d(
    proj_3d, x=0, y=1, z=2,
    color=y, labels={'color': 'Grade'}
)
fig_3d.update_traces(marker_size=3)
In [83]:
fig_3d.show()
In [84]:
color_by_cluster(kmeans_df, proj_3d, 'KMeans', 'UMAP')
In [85]:
color_by_cluster(hier_df, proj_3d, 'Hierarchical', 'UMAP')
In [86]:
color_by_cluster(hdb_df, proj_3d, 'HDBScan', 'UMAP')

Based on the complexity of the data, an extra dimension goes a long way. Out of all the visualizations, I prefer UMAP 3D as it does a better job of grouping datapoints, not necessairly by grade, but by actual clusters.

Conclusion¶

Kmeans - 12 clusters¶

  • Overall - 0.45
  • ARI - 0.093
  • SS - 0.11

Hierarchical - 22 clusters¶

  • Overall - 0.46
  • ARI - 0.072
  • SS - 0.31

HDBSCAN - 19 clusters (+1 non-cluster)¶

  • Overall - 0.46
  • ARI - 0.061
  • SS - 0.63

From the visual 3D graphs, most of the clustering makes sense. Compared to the ARI and SS evaluations, HDBScan seemed to perform the best.